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The need for higher yielding and better-adapted crop plants for feeding the world's rapidly growing 
population has raised the question of how to systematically utilize large genebank collections with their wide 
range of largely untouched genetic diversity. Phenotypic data that has been recorded for decades during 
various rounds of seed multiplication provides a rich source of information. Their usefulness has remained 
limited though, due to various biases induced by conservation management over time or changing 
environmental conditions. Here, we present a powerful procedure that permits an unbiased trait-based 
selection of plant samples based on such phenotypic data. Applying this technique to the wheat collection of 
one of the largest genebanks worldwide, we identified groups of plant samples displaying contrasting 
phenotypes for selected traits. As a proof of concept for our discovery pipeline, we resequenced the entire 
major but conserved flowering time locus Ppd-Dl in just a few such selected wheat samples - and nearly 
doubled the number of hitherto known alleles. 



Climate change and the rapidly growing human population trigger the demand for significantly improved 
crop plants 17 . At present, new plant varieties are mainly developed through the reshuffling of alleles 
present in the elite gene pool (crossing the best - hoping for the best) resulting in a more or less constant 
repertoire of alleles or even an erosion of genetic diversity. As a result, genetic gains have gradually slowed for 
most of the major crop species including wheat 8,9 . As a result, the relative increase in grain yield of wheat 
(Triticum aestivum L.), the most dominant crop species with a global acreage of 217 million ha (http://faostat. 
fao.org/), has fallen short of the global population growth rate in many parts of the world. 

One of the most promising approaches to cope with this challenge is to valorize the rich genetic diversity 
conserved in large germplasm collections stored ex situ in genebanks for trait improvement by the introduction of 
novel alleles 1013 . However, the identification of plant samples (plant samples commonly referred to as 'accessions' 
in a genebank context) with specific combinations of traits from genebank collections is inherently difficult, 
because of varying regeneration intervals (due to improved storage conditions) and owing to the fact that 
collections were not regenerated in a systematic scheme (i.e. individual regeneration cycles did not comprise 
the same set of accessions). Furthermore, changes in conservation management over time (evolution of agricul- 
tural practices, equipment, storage and regeneration strategies, and genebank standards) impinge on the useful- 
ness of phenotypic observations. Finally, environmental effects including inter-annual weather variability and 
long-term climate change affect plant phenology 414-18 . This is reflected by a shift of up to three weeks towards 
earlier flowering within the last 60 years, correlated with a significant increase of average spring temperatures over 
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this period (Supplementary Fig. SI, Supplementary Fig. S2). As a 
result, technical and environmental effects lead to highly heterogen- 
eous phenotypic data which makes valorization of genetic resources a 
daunting task. 

Results and Discussion 

The normalized rank product is less biased than other descriptive 
statistics. Against this backdrop, we aimed at exploiting long-term 
phenotypic data to enable systematic access to large genebank 
collections for upcoming genome resequencing and genome-wide 
association studies (Fig. 1). We propose a statistical approach, 
which allows for the consolidation of data sets that were collected 
over sixty years of seed multiplication. The approach aims at the 
comparison of accessions grown in different years using a 
standardized scale. This is achieved by calculating the Normalized 
Rank Product (NRP) to order accessions for a specific trait under 
consideration (Supplementary Table SI). Thus, accessions grown 
across multiple decades can be compared for particular traits using 
a common scale ranging from 0 to 1. 

For validation, the NRP was compared against descriptive statist- 
ics like the mean, for ranking accessions based on historical data for 
flowering time. Analyzing this trait by using a 9-year sliding window 
for the annual median flowering time (Fig. 2A), it becomes obvious 
that the flowering time depends on the year of cultivation. More 



specifically, early flowering was observed predominantly in recent dec- 
ades (after 1980), while late flowering was observed in early decades. 
Based on this observation, we compare a varying number of accessions 
with best mean and NRP values by examining the percentage of acces- 
sions that have been cultivated for the first time after 1980 (Fig. 2B). 
For the mean, the percentage initially is very high and converges with 
increasing number of selected accessions to the percentage of acces- 
sions entered into the collection after 1980, whereas the percentage 
fluctuates around the global value for the NRP. This observation indi- 
cates that the NRP is less influenced by the shift of flowering time. 

To further validate the power of the NRP, two winter wheat acces- 
sions showing contradicting mean and NRP were scrutinized as an 
example. A direct comparison was facilitated because of data from 2 
years of common cultivation (1998, 2000). In both years, the acces- 
sion with smaller NRP flowered earlier than the one with smaller 
mean flowering time (Fig. 2C) again confirming the accuracy of the 
NRP. Finally, mean and NRP were assessed on all pairs of accessions 
that have been cultivated together in at least two years as described in 
the specific example above. Based on this evaluation, the NRP was 
shown to be less biased than the mean (Supplementary Fig. S3). Thus, 
the NRP is a robust trait characteristics and can be used for instance 
for clustering and principal component analysis. As an example, 
every accession in the collection can be described by more than 
one normalized trait. 
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Figure 1 | A novel strategy for valorizing genetic diversity stored in genebanks. Every year genebank curators select accessions for seed multiplication. 
These accessions are grown in field trials and their phenotypic observations are recorded during the growing period. Annual ranking of these data 
is the first step in NRP analysis which allows comparing accessions that have been cultivated in different years and under different conditions. This 
example visualizes a wheat accession cultivated four times since 1 946. The histograms indicate the distributions for all accessions cultivated in a given year. 
Missing histograms indicate missing phenotypic observations. Based on NRPs, MTO allows identifying accessions with specific combinations of traits 
that can be utilized for targeted plant research and breeding. The black asterisk and red dot in the cube represent the best virtual and one real accession, 
respectively, that simultaneously have early flowering time, small plant height and high thousand grain weight. In the histograms of absolute trait values, 
the actual accession is indicated by red lines. This figure has been generated using Adobe Photoshop CC and Adobe Illustrator CC. 
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Figure 2 | Validation of the NRP for flowering time of winter wheat. (A) Box plots and trend of flowering time (FTi, in days of year, ordinate) 
for winter wheat between 1946 and 20 1 0 (abscissa) . The trend in blue is the 9-year sliding window for annual median flowering time indicating a clear shift 
towards earlier flowering. (B) Comparison of NRP with the naive method of averaging the phenotypic observations (mean). The panel plots the 
percentage of accessions that have been cultivated for the first time after 1980 against the number of selected accessions. For the mean, the percentage is 
initially very high and converges to the global value. In contrast, the behavior of the NRP is almost constant. (C) Specific example of early flowering time 
for two winter wheat accessions. Green visualizes the accession TRI 7594 selected by NRP, while red visualizes the accession TRI 16575 selected by mean. 
Hence, the red accession has a smaller mean (149 < 153) but a higher NRP (0.51 > 0.06) than the green accession (Supplementary Table SI). However, 
the two common cultivations in 1998 and 2000 illustrate that the green accession typically flowers earlier than the red one. Plots were created with R 
(http://r-project.org). 



Multi-trait optimization identifies promising accessions for 
further evaluation. In Fig. 1, the results are presented for three 
traits, namely, thousand grain weight, flowering time, and plant 
height. In general if one is interested in n different traits, one 
might span an n-dimensional hypercube. For the identification of 
promising accessions for further evaluation, we used a multi-trait 
optimization (MTO) approach based on the NRP to overcome an 
intrinsic structure caused by correlations between normalized traits 
and dependencies on passport data (Supplementary Text A, 
Supplementary Text B). 

This approach selects accessions which are located near the cor- 
ners of the cube corresponding to the most extreme trait combina- 
tions (Fig. 1, Fig. 3 A, Supplementary Fig. S4A). Hence, opposing 
corners represent contrasting phenotypes (Supplementary Fig. S4). 
For example, the corner ( 1 ,0,0) indicated by an asterisk in Fig. 1 refers 
to the combination of high thousand grain weight, early flowering, 
and short plant height, whereas the corner (0,1,1) refers to the con- 
trasting combination of low thousand grain weight, late flowering, 
and tall plant height. Based on that, MTO allows for identifying 
accessions breaking the existing correlations between traits, i.e., hav- 
ing combinations of traits that only rarely occur in a given gene pool 
(cf. empty corners in Fig. 3A, Supplementary Fig. S5, Supplementary 
Fig.S6). 

To validate this MTO, we used long-term phenotypic data that 
were available for nearly 7,000 winter wheat accessions. For further 
validation, four groups of contrasting phenotypes were selected for 
extreme plant height and extreme flowering time (Supplementary 
Table S2) and grown in a field experiment. In total, 60 accessions (15 



accessions per contrasting group) were grown together, which 
allowed for the direct comparison of their phenotypes. Com- 
parison with the legacy data revealed highly similar patterns dem- 
onstrating the efficacy of MTO (Fig. 3B, Fig. 3C, Supplementary 
Table S2). 

Proof of concept: resequencing at Ppd-Dl locus nearly doubled 
the number of known alleles. To investigate the potential of trait- 
based selection for discovering new alleles, all 60 contrasting 
accessions obtained from MTO (Fig. 3A), represented by 96 
individual plants, were resequenced at the major photoperiod 
response locus Ppd-Dl (Supplementary Table S4). For this, we 
utilized sequence information from the International Wheat 
Genome Sequencing Consortium (IWGSC) 19 ' 21 . Despite the small 
sample size and the high degree of conservation at this locus, three 
novel alleles were identified, nearly doubling the number of 
haplotypes known for bread wheat at this locus 22 25 (Supplement- 
ary Table S5, Supplementary Fig. S8). Two of the polymorphisms 
were located in coding regions upstream of the CCT domain leading 
to premature stop codons and thus most likely cause a loss of 
function (Supplementary Text C, Supplementary Dataset). 

In conclusion, we proposed a procedure to compare results from 
"non orthogonal" field trials which are a hallmark of the continuous 
reproduction schemes of genebank collections. The solution, illu- 
strated here for wheat, may be applied to other crops and their wild 
relatives, making this strategy a standard approach to select for 
'wanted' phenotypic combinations and to 'widen' the breeding bot- 
tleneck. This study demonstrates that the enormous amount of 
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Figure 3 | Validation of MTO for a wheat collection comprising 6,959 accessions (Supplementary Table S6). (A) depicts the results of the MTO 
for the normalized traits flowering time (NRP FTi) and plant height (NRP PH) selecting four contrasting groups with 15 accessions each. 
(B) and (C) compare the phenotyping results (minimum, mean and maximum of pre-normalized values) of the field experiment 2010/2011 and the 
historical phenotypic data (1946-2009) for plant height and flowering time, respectively. In all three panels, the colors encode the selected contrasting 
groups, where black represents short and early flowering accessions, red represents tall and early flowering, green represents short and late flowering, and 
blue represents tall and late flowering (Supplementary Fig. S2). 



phenotypic data, recorded in genebanks over decades, can indeed be 
used to select accessions with desired combinations of traits. 
Computational methods as the NRP can be applied immediately 
and at almost zero cost - harvesting the investment of time, energy 
and money made in genebanks over generations and proving their 
continued value. 

Based on NRPs and MTO of long-term phenotypic data, it is 
feasible to accurately assess the genotypic diversity and identify 
divergent accessions from comprehensive collections of any size, 
thereby avoiding the limitations entailed by the use of "core collec- 
tions" 26 . Additional information including genotypic and pedigree 
information, if available, might be used to even better assess the 
genotypic diversity. Identification of a small set of contrasting phe- 
notypes for flowering time immediately allowed for tapping into a 
substantial amount of novel allelic diversity at a major flowering time 
locus in wheat, which can be characterized and validated in future 
studies. Upcoming genome sequences for major cereal crops and 
new techniques including whole genome exome capture 19 " 21,27 will 
allow for a much more extensive association of phenotypic and allelic 
diversity and thus help to bridge the genotype to phenotype gap. 

Methods 

Federal genebank of Germany. The Federal ex situ Genebank for Agricultural and 
Horticultural Plant Species of Germany maintained at the Leibniz Institute of Plant 
Genetics and Crop Plant Research (IPK) in Gatersleben hosts one of the largest 
barley and wheat germplasm collections in the world 28 . The collection was established 
at its present location in 1945, and seed multiplication has been continuously 
performed since 1946. 

The cultivation for seed multiplication has taken place every two to three years 
until 1976. After a cold storage facility was established in 1976, the average frequency 
of seed multiplication dropped drastically to every 10-30 years and accessions have 
been selected for seed multiplication if either the germination rate or the amount of 
available seeds dropped below a critical threshold 29 . 

Characterization and evaluation data. For IPK's seed bank, Characterization and 
Evaluation (C & E) data comprises dates of various growth stages (e.g., sowing, 
emergence, heading, flowering, ripening, harvest), winter survival, lodging, infection 
scores for some of the most important fungal diseases, plant height, and thousand- 
grain weight. In this paper, we focused the analysis on three key traits, namely, days to 
flowering (FTi), plant height (PH), and thousand-grain weight (TGW) for wheat 
{Triticum aestivum L.) and barley (Hordeum vulgare L.). Spring and winter types 
(growth habit) are distinguished for both cereals (Note: in Central and Northern 
Europe, winter types are sown in autumn - September until December while spring 
types are sown in spring - March until April. 

Preprocessing of C & E data. Since 1946 the C & E data have been recorded in field 
books during the annual seed multiplication (Note: Since 2011, Personal Digital 



Assistants (PDA) are used for data recording in the field making the repeated transfer 
of these data unnecessary.) Subsequently, these data have been transferred to main 
card files. However, the data had to be digitized prior to computational analysis. Thus, 
the data had been manually transferred twice, namely from field books to main card 
files and then to the computer, increasing the potential of introducing errors. 

Hence we checked for spurious values in the recorded traits, and tested the tem- 
poral order of the recorded dates. The data was split into winter and spring types 
based on their classification of growth habit, commonly referred to as 'annuality' 
among genebanks, in IPK's Genebank Information System (GBIS, http://gbis.ipk- 
gatersleben.de). 

For the identification of potential errors, we subsequently performed an extreme 
outlier detection using three times the interquartile range. These outliers were 
checked against the field books and corrected wherever possible. To ensure that any 
accession provided at most one record per year, we checked for accessions with more 
than one record in any year. In case of an accession that has been recorded more than 
once in a year, the records were replaced by a single record with the median of the 
observations for the subsequent analysis. 

For further analysis, the values for plant height (recorded at growth stage Z70 in 
cm 30 ,) and thousand grain weight (in g) were used directly, while we computed time 
periods in days for the flowering time. Thus flowering time is recorded as day of year, 
i.e., the difference between January 1 st and date of flowering (Z65 30 ,). In Tab. S6, we 
present the amount of data records for barley and wheat classified according to their 
annuality. After outlier detection, each collection comprised several thousand 
accessions and up to five times more records. 

Normalized rank product on C & E data. Due to the heterogeneity of the data, 
descriptive statistics cannot easily be utilized for the data analysis. Instead the 
recorded values of each trait in each year were ranked to overcome the strong effects 
of annual differences, for instance inter-annual climate variability, climate change or 
agricultural practice. Subsequently, we introduced the normalized rank product 
(NRP) that is based on the idea of the popular rank product analysis, originally 
proposed for selecting differentially expressed genes from microarray experiment 
data 31 . 

Here, we extend the rank product to maintain the relation to the geometric mean of 
the normalized ranks. The NRP of an accession a for a specific trait denoted by NRP a 
is defined as, 

NRP a = "W n ^, 
\JyeY, n y 

for a specific trait where Y a is the set of years in which accession a has been cultivated 
and its trait has been recorded, r y>a is the rank of the trait value of accession a in year y, 
and n y is the number of accessions with this trait recorded in year y. The NRP ranges 
between zero (exclusive) and one (inclusive) allowing for comparison between 
accessions, independent of their number of years of cultivation. The NRPs were 
computed for the three important traits: flowering time, plant height, and thousand 
grain weight. 

In order to avoid overestimating the influence of accessions that have been infre- 
quently cultivated and exhibiting interesting NRPs by chance, only accessions that 
have been cultivated and recorded in at least three different years were included in the 
present study. With this set of assumptions, we were able to compare large parts of the 
above mentioned collections comprising thousands of accessions. 
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As an illustrative example, a particular wheat accession has been cultivated in 1946, 
1967 and 2009. This plant sample was the first to flower in 1946 and 2009 (rank 1), 
and the third in 1967 (rank 3). These ranks are divided by the number of all accessions 
of that species cultivated in each year. The geometric mean is taken over 1946, 1967 
and 2009 yielding a NRP for flowering time of this accession ranging between 0 (early 
flowering) and 1 (late flowering). 

Experimental validation under field conditions for winter wheat. Based on the 
NRP and multi-trait optimization (MTO), we were able to fmd plant samples with 
contrasting phenotypes, for example winter wheat with extreme flowering time and 
plant height (Fig. 3A). To validate the accuracy of this procedure, 60 winter wheat 
accessions (four groups with 15 accessions each) were selected and planted in a field 
experiment at Gatersleben (51° 49'N, 11° 16'E) during the growing season 2010/ 
201 1. We used plots of size 1.5 m X 2.5 m separated by equally sized plots of winter 
barley and followed local agricultural practice. In the field experiment, we performed 
three replicates arranged in three blocks. Each block consisted of one plot per selected 
accession in random order. 

Each plot was planted at a density of 80 seeds per m 2 . Subsequently, 10 individuals 
per plot - in total 1 ,800 individuals - were randomly selected and genetically purified 
as genebank accessions are not expected to be homogenous and must be purified for 
genomic studies. Thus all 1800 individuals were covered with bags to prevent cross 
pollination and phenotyped. The development of single seed descends (SSDs) is a 
procedure widely used in plant breeding 32 . All individuals were phenotyped for 
flowering time (Z60 30 ) as well as for plant height (Z70 30 ), and were finally harvested. 
Leaf samples for DNA isolation were taken from every plant at Z30 30 . 

Based on known geographical origin and pedigree information, a subset of 28 
accessions (out of 60 accessions mentioned above) was then subselected (6-8 acces- 
sions per contrasting group) for an independent second field experiment in 2011/ 
2012 - in order to validate the data from the experiment in 2010/2011. As in the year 
before, we used plots of size 1.5 m X 2.5 m separated by equally sized plots of winter 
barley and followed local agricultural practice. Three replicates were performed, 
arranged in three blocks. Each block consisted of one plot per selected accession in 
random order. Each plot was planted at a density of 80 seeds per m 2 . Full plots were 
phenotyped for flowering time (Z60 30 ) as well as for plant height (Z70 30 , Fig. S7). 

Extraction of genomic DNA. Genomic DNA was isolated from silica-dried single 
leaves of each line with the Qiagen DNeasy Plant Mini Kit (Qiagen, Hilden, 
Germany), according to the manufacturer's instructions. 

Genotyping major vernalization genes for wheat. Wheat can be divided into spring 
and winter habit varieties along with a group of intermediate lines known as 
facultative varieties. The vernalization- sensitivity (winter habit) alleles, vrn-Al, vrn- 
Bl and vrn-Dl are located on chromosomes 5A, 5B and 5D, respectively 33,34 . Another 
vernalization gene, Vrn-3, located on chromosome 7B, was shown to encode a 
homolog of Arabidopsis FT, and was named VRN3 35 . 

Aiming at reducing any bias for flowering time analysis based on growth habit, in 
total, 60 purified individuals (1 to 3 plants per accession) were randomly selected 
among the 28 contrasting accessions subselected for the second field experiment (see 
above) and genotyped at four major vernalization loci (Vrn-Al, Vrn-Bl, Vrn-Dl and 
Vrn-B3) using allele- specific molecular assays. Following the experimental proto- 
cols 22-24 ' 35-40 (Tab. S3), all accessions were confirmed as true winter wheat harboring 
only vernalization- sensitive alleles providing strong support for the flowering time 
analysis. 

Resequencing the major photoperiod response locus Ppd-Dl. Sixty contrasting 
accessions (represented by 96 plants, 1 to 3 plants per accession) were resequenced at 
Ppd-Dl. In order to develop wheat D genome specific primer combinations, sequence 
information was obtained from the International Wheat Genome Sequencing 
Consortium (IWGSC, http://www.wheatgenome.org/) and NCBI GenBank 
(DQ885766.1). 

The Primer3 online software (Primer3 v. 0.4.0 (http://frodo.wi.mit.edu/primer3/) 
was used to design primers. Oligonucleotides were purchased from Eurofins MWG 
Operon, Ebersberg, Germany. One c. 5920 bp genomic region completely covering 
Ppd-Dl was amplified by locus-specific PCR primers from genomic DNAs (Tab. S4). 
5'-UTR, exons, introns and 3'-UTR were analyzed, as start and stop codon of the 
reference (DQ885766.1) were located at alignment position 2150 and 5299, 
respectively. The exons were located at 2150 to 2321, 2439 to 2603, 2712 to 2845, 3041 
to 3196, 3648 to 3825, 3932 to 4349, 4432 to 5089, 5200 to 5301. Specificity and 
chromosomal localization of PCR products were confirmed by Nulli-tetrasomic (NT) 
lines (N2A-T2B, N2B-T2D and N2D-T2B) 41 (Fig. S8). 

PCR amplification was performed in 20 ul reaction volume. Templates were 
purified and sequenced directly on both strands on an Applied Biosystems 
(Weiterstadt, Germany) ABI Prism 3730 xL sequencer using BigDye terminators as 
described in 42 . Their DNA sequences were determined using primers designed for 
amplification and internal primers (Tab. S4). 

Single nucleotide polymorphism (SNP) -detection. DNA sequences were processed 
with AB DNA Sequencing Analysis Software 5.2 and later manually edited by 
Sequencher software v5.0 (Gene Codes Corp.). Sequence alignments were generated 
with MAFFT webserver (http://www.ebi.ac.uk/Tools/msa/mafft/) using default 
parameters except "perform ffts" which was set to "genafpair". Subsequently, the 
multiple sequence alignment was manually modified. Filtering sequences with low 



quality regions, 67 sequences were contained in the final alignment comprising c. 
5920 bp. The heterozygous state at the deletion in the promoter region was indicated 
by inserting poly-N. In close analogy, the heterozygote state for the transposable 
element in the first intron was indicated by inserting poly-N. 

Allelic haplotypes were defined by DNASP 5.10.01 as described in 42 (Tab. S5). All 
identified singletons were confirmed afterwards by additional two independent 
amplifications and sequencing. 
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